If you have been following along with my posts you may have realized that something I hadn't spent a lot of time dealing with was time series and subsequent forecasting. I have dealt with sequences (both via Recurrent Neural Networks and Markov Models), but given vast amount of time series data that you can encounter in industry, this post is long over due.
Before digging into the theory, I think that it may be the most beneficial to start from the top and work our way down. Namely, I'd like to get every one exposed to the actual interaction with time series data through the pandas library, and then we will move on to specific forecasting techniques and dive into the mathematics. The reason for this is because in my experience, getting the time series data into the correct format and manipulating it as needed is a bit more challenging than the traditional ML preprocessing (especially if you have never worked with it before). With that said, let's get started!
from datetime import datetime
# Date information
year = 2020
month = 1
day = 2
# Time information
hour = 13
mins = 30
sec = 15
date = datetime(year, month, day, hour, mins, sec)
print(date)
Now, while python does have the built in ability to handle date times, numpy is more efficient when it comes to handling date's. The numpy data type for date times datetime64, here. It will be have a different type compared to python's built in datetime:
import numpy as np
np_date = np.array(["2020-03-15"], dtype="datetime64")
print(f"Pythons datetime type: {type(date)}\n", "Numpy datetime type: ", np_date.dtype)
We can take this one step further and actually create numpy date ranges. By using np.arange() we can create a date range easily as follows:
display(np.arange("2018-06-01", "2018-06-23", 7, dtype="datetime64[D]")) # Where the dtype specifies our step type
Pandas handles datetimes in a way that is built in top of numpy. The API to create a date range is shown below:
import pandas as pd
display(pd.date_range("2020-01-01", periods=7, freq="D")) # String code D stands for days
The return value is a pandas DatetimeIndex which is a specialized pandas index built for datetimes. In other words, it is not just normal string codes, rather it is aware that they are datetime64 objects. Pandas is able to handle a variety of string codes, however, we will stick with the standard year-month-day.
A nice helper method that pandas offers is the to_datetime method:
display(pd.to_datetime(["1/2/2018"]))
Which again returns a DatetimeIndex. An interesting thing to note is that if we don't pass in a list above, we receive a Timestamp object instead:
display(pd.to_datetime("1/2/2018"))
Another common bit of preprocessing that you will most certainly come across is receiving date times in a format that is not expected/the default date time format. For instance, imagine that you are being sent a series of date times from an external database and the format is:
"2018--1--2"
Well, thankfully pandas to_datetime has a format key word argument that can be used as follows:
display(pd.to_datetime(["2018--1--2"], format="%Y--%m--%d"))
display(pd.to_datetime(["2018/-1/-2"], format="%Y/-%m/-%d"))
Finally, we can create a pandas dataframe with a date time index:
idx = pd.date_range("2020-01-01", periods=3, freq="D")
cols = ["A", "B"]
df = pd.DataFrame(np.random.randn(3, 2), index=idx, columns=cols)
display(df)
And we can see that our index is indeed comprised of datetime objects:
display(df.index)
And also perform operations such as:
display(df.index.max())
Now that we have an idea of how to deal with basic time series object creation in native python, numpy, and pandas, we can start perform more specific time series manipulation. To start, we can look at time resampling. This operates similar to a groupby operation, except we end up aggregating based off of some sort of time frequency.
For example, we could take daily data and resample it into monthly data (by taking the average, or some, or some other sort of aggregation). Let's look into this further with a real data set, a csv related to starbucks stock prices by date. We will read in our csv with a date index that is a datetime and not a string:
import boto3
s3 = boto3.client('s3')
bucket = "intuitiveml-data-sets"
key = "starbucks.csv"
obj = s3.get_object(Bucket=bucket, Key=key)
df = pd.read_csv(obj['Body'], index_col="Date", parse_dates=True)
display(df.head())
Seeing that our index is in fact a date time:
display(df.index)
We can now perform a basic resampling as follows (the rule A is found here):
# daily ---> yearly
df_resampled = df["Close"].resample(rule="A").mean()
display(df_resampled)
A very cool feature is that we can even implement our own resampling functions (if mean, max, min, etc do not provide the necessary functionality):
def first_day(entry):
if len(entry): return entry[0]
df_resampled = df["Close"].resample(rule="A").apply(first_day)
display(df_resampled)
We can of course combine this resampling with some basic plotting. Below, we can see the average closing price per year:
import matplotlib.pyplot as plt
import cufflinks
import plotly.plotly as py
import plotly
import plotly.graph_objs as go
from plotly.offline import iplot
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl', offline=True)
trace1 = go.Bar(
x=df_resampled.index,
y=df_resampled.values,
marker = dict(
color = 'green',
),
)
data = [trace1]
layout = go.Layout(
showlegend=False,
width=500,
height=400,
title="Yearly Mean Closing Price for Starbucks",
xaxis=dict(title="Date"),
yaxis=dict(title="Mean Closing Price")
)
fig = go.Figure(data=data, layout=layout)
# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))
We can of course perform the same sort of resampling at a monthly frequency as well:
df_resampled = df["Close"].resample(rule="M").max()
display(df_resampled.head())
# df_resampled.iplot(
# kind="bar",
# color="green",
# title="Monthly max Closing Price for Starbucks",
# xTitle='Date',
# yTitle='Mean Closing Price',
# dimensions=(650,350)
# )
trace1 = go.Bar(
x=df_resampled.index,
y=df_resampled.values,
marker = dict(
color = 'green',
),
)
data = [trace1]
layout = go.Layout(
showlegend=False,
width=500,
height=400,
title="Yearly Mean Closing Price for Starbucks",
xaxis=dict(title="Date"),
yaxis=dict(title="Mean Closing Price")
)
fig = go.Figure(data=data, layout=layout)
# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))
Sometimes when working with time series data, you may need to shift it all up or down along the time series index. Pandas has built in methods that can easily accomplish this. Recall the head of our starbucks df
display(df.head())
If we shift our rows by a single row we end up with the following:
display(df.shift(1).head())
We can also shift based on frequency codes. For instance, we can shift everything forward one month:
display(df.shift(periods=1, freq="M").head())
Let's now take a minute to go over rolling and expanding our time series data with pandas. The basic premise is that a common process when working with time series data is to create data based off of a rolling mean. So, what we can do is divide the data into windows of time, then calculate an aggregate for each moving window. In this way we will have calculated a simple moving average.
Recall that our closing price data looks like:
trace1 = go.Scatter(
x=df.index,
y=df.Close.values,
marker = dict(
color = 'green',
),
)
data = [trace1]
layout = go.Layout(
showlegend=False,
width=650,
height=350,
title="Closing Price for Starbucks",
xaxis=dict(title="Date"),
yaxis=dict(title="Closing Price")
)
fig = go.Figure(data=data, layout=layout)
# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))
What we are going to do is add in a rolling mean. A rolling mean simply create a little window, say 7 days, and it looks at each section of 7 days and performs some sort of aggregate function on it. In this case it will be a mean, or average. So every 7 days will will take the mean and keep rolling it along with that 7 day window.
trace1 = go.Scatter(
x=df.index,
y=df.Close.rolling(window=7).mean().values,
marker = dict(
color = 'green',
),
)
data = [trace1]
layout = go.Layout(
showlegend=False,
width=650,
height=350,
title="Rolling Average",
xaxis=dict(title="Date"),
yaxis=dict(title="Closing Price")
)
fig = go.Figure(data=data, layout=layout)
# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))
To be sure that this is clear, look at the first 10 rows of our dataframe:
df["Close"].head(10)
Now, the rolling averages are found as follows:
for i in range(0, 4):
print(f"Window {i+1}, the average of rows {i}:{i+7} ->", df["Close"][i:i+7].mean())
Which we can see is equivalent to the rolling average found via pandas:
df["Close"].rolling(
window=7
).mean()[6:10]
Let's now overlay our original closing price with the rolling average:
df_rolling_window = df["Close"].rolling(
window=7
).mean()
trace1 = go.Scatter(
x = df_rolling_window.index,
y = df_rolling_window.values,
mode="lines",
marker = dict(
size = 6,
color = 'orange',
),
name="Rolling Mean, Window = 7 days"
)
trace2 = go.Scatter(
x = df["Close"].index,
y = df["Close"].values,
mode="lines",
marker = dict(
size = 6,
color = 'blue',
),
name="Original"
)
data = [trace2, trace1]
layout=go.Layout(
title="Rolling Average (7 day window) vs. No transformation Starbucks Closing Price",
width=950,
height=500,
xaxis=dict(title="Date"),
yaxis=dict(title='Closing Price'),
legend=dict(x=0.05, y=1)
)
fig = go.Figure(data=data, layout=layout)
# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))
We can of course increase our window size, and we will subsequently see more smoothing:
df_rolling_window = df["Close"].rolling(
window=30
).mean()
trace1 = go.Scatter(
x = df_rolling_window.index,
y = df_rolling_window.values,
mode="lines",
marker = dict(
size = 6,
color = 'orange',
),
name="Rolling Mean, Window = 30 days"
)
trace2 = go.Scatter(
x = df["Close"].index,
y = df["Close"].values,
mode="lines",
marker = dict(
size = 6,
color = 'blue',
),
name="Original"
)
data = [trace2, trace1]
layout=go.Layout(
title="Rolling Average (30 day window) vs. No transformation Starbucks Closing Price",
width=950,
height=500,
xaxis=dict(title="Date"),
yaxis=dict(title='Closing Price'),
legend=dict(x=0.05, y=1)
)
fig = go.Figure(data=data, layout=layout)
# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))
As we continue increasing the window size, we can see that we are viewing a more general trend. Now, in addition to rolling windows we can also work with expanding. For instance, what if we wanted to take into account everything from the start of the time series up to each point in time (aka a cumulative average). This would work as follows:
df_expanding = df["Close"].expanding().mean()
trace1 = go.Scatter(
x=df_expanding.index,
y=df_expanding.values,
marker = dict(
color = 'green',
),
)
data = [trace1]
layout = go.Layout(
showlegend=False,
width=650,
height=350,
title="Expanding Closing Price Average",
xaxis=dict(title="Date"),
yaxis=dict(title="Closing Price")
)
fig = go.Figure(data=data, layout=layout)
# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))
We can see that this curve is clearly logarithmic, as an expanding mean generally will be.
I want us to now move on to learning about the main tool we will use in time series forecasting: the Statsmodels library. Statsmodels can be thoughts of as follow:
A python module that provides classes and function for the estimation of many different statistical models, as well as conducting statistical tests, and statistical data exploration.
Keep in mind that we won't really be doing any forecasting yet. Rather, we will be familiarizing with the statsmodels library and some of the statistical tests that you can be performing on time series data.
Let's look at some basic properties of time series data. To begin, we have trends. Time series can have trends, as seen below:
from plotly import tools
x = np.arange(0,50,0.01)
y_stationary = np.sin(x)
y_upward = x*0.1 + np.sin(x)
y_downward = np.sin(x) - x*0.1
trace1 = go.Scatter(
x=x,
y=y_stationary
)
trace2 = go.Scatter(
x=x,
y=y_upward,
xaxis='x2',
yaxis='y2'
)
trace3 = go.Scatter(
x=x,
y=y_downward,
xaxis='x3',
yaxis='y3'
)
data = [trace1, trace2, trace3]
layout = go.Layout(
showlegend=False
)
fig = tools.make_subplots(
rows=1,
cols=3,
subplot_titles=("Stationary", "Upward", "Downward"),
print_grid=False
)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)
fig.append_trace(trace3, 1, 3)
fig['layout']['yaxis1'].update(range=[-3, 3])
fig['layout'].update(
showlegend=False,
height=300
)
# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))
Above, we have can see upward, stationary, and downward trends. Time series will always exhibit one of the above trends. Additionally, time series can exhibit seasonality, a repeating trend:
s3 = boto3.client('s3')
bucket = "intuitiveml-data-sets"
key = "monthly_milk_production.csv"
obj = s3.get_object(Bucket=bucket, Key=key)
df = pd.read_csv(obj['Body'], index_col="Date", parse_dates=True)
clipped_df = df["1962-01-01":"1968-01-01"]
trace1 = go.Scatter(
x=clipped_df.index,
y=clipped_df["Production"].values,
marker = dict(
size = 6,
color = 'green',
),
)
data = [trace1]
layout = go.Layout(
showlegend=False,
width=800,
height=400,
title="Seasonality and Upward Trend",
xaxis=dict(title="Date")
)
fig = go.Figure(data=data, layout=layout)
# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))
We can clearly see in the plot above that their is a seasonality trend associated with the data. At around the 3rd month of each year we observe a peak. It also looks as though this trend repeats every cycle. Hence, overall it looks like the volume of search results is going down.
Finally, we also have cyclical components. Cyclical components are trends that have no set repetition. Here is does look like they are trends, however, it does not look like it occurs on a regular cycle.
bucket = "intuitiveml-data-sets"
key = "starbucks.csv"
obj = s3.get_object(Bucket=bucket, Key=key)
df = pd.read_csv(obj['Body'], index_col="Date", parse_dates=True)
trace1 = go.Scatter(
x=df.index,
y=df["Close"].values,
marker = dict(
size = 6,
color = 'green',
),
)
data = [trace1]
layout = go.Layout(
showlegend=False,
width=800,
height=400,
title="Cyclical and Upward Trend",
xaxis=dict(title="Date")
)
fig = go.Figure(data=data, layout=layout)
# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))